N-l 324-HEW 


October  1979 


A  LOOK  AT  VARIOUS  ESTIMATORS  IN  LOGISTIC  MODELS  IN  THE  PRESENCE 
OF  MISSING  VALUES 

Winston  K.  Chow 


A  Rand  Note 

prepared  for  the 

U.S.  DEPARTMENT  OF  HEALTH,  EDUCATION,  AND  WELFARE 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-1979  to  00-00-1979 

4.  TITLE  AND  SUBTITLE 

A  Look  at  Various  Estimators  in  Logistic  Models  in  the  Presence  of 
Missing  Values 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Rand  Corporation, 1776  Main  Street, PO  Box  2138, Santa 

Monica, CA, 90407-2138 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF:  17.  LIMITATION  OF 

_ _ _  ABSTRACT 

18.  NUMBER  19a.  NAME  OF 

OF  PAGES  RESPONSIBLE  PERSON 

a.  REPORT  b.  ABSTRACT  c.  THIS  PAGE  Same  OS 

unclassified  unclassified  unclassified  Report  (SAR) 

32 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


The  research  reported  herein  was  performed  pursuant 
to  Grant  No.  016B-7901-P2021  from  the  U.S.  Department 
of  Health,  Education,  and  Welfare,  Washington,  D.  C. 

The  opinions  and  conclusions  expressed  herein  are  solely 
those  of  the  author,  and  should  not  be  construed  as 
representing  the  opinions  or  policy  of  any  agency  of  the 
Unites  States  Government. 


The  Rand  Publications  Series:  The  Report  is  the  principal  publication  doc¬ 
umenting  and  transmitting  Rand’s  major  research  findings  and  final  research 
results.  The  Rand  Note  reports  other  outputs  of  sponsored  research  for 
general  distribution.  Publications  of  The  Rand  Corporation  do  not  neces¬ 
sarily  reflect  the  opinions  or  policies  of  the  sponsors  of  Rand  research. 


Published  by  The  Rand  Corporation 


N-l 324-HEW 


October  1979 


A  LOOK  AT  VARIOUS  ESTIMATORS  IN  LOGISTIC  MODELS  IN  THE  PRESENCE 
OF  MISSING  VALUES 


Winston  K.  Chow 


A  Rand  Note 

prepared  for  the 


U.S.  DEPARTMENT  OF  HEALTH,  EDUCATION,  AND  WELFARE 


PREFACE 


This  Note  was  prepared  for  presentation  at  the  annual  meeting 
of  the  American  Statistical  Association,  Washington,  D.C.,  August  13- 
16,  1979.  It  reports  on  Rand  research  supported  by  a  grant  from  the 
U.S.  Department  of  Health,  Education,  and  Welfare. 

The  objective  of  this  research  is  to  present  various  methods  for 
estimating  parameters  of  logistic  regression  models  in  the  presence 
of  missing  values .  Many  of  the  commonly  used  techniques  for  treating 
missing  values  in  multiple  regression  are  incorporated  into  the 
logistic  regression  framework. 
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SUMMARY 


Two  commonly  used  procedures  for  estimating  the  parameters  of  a 
logistic  regression  function  are  the  maximum  likelihood  estimators 
and  the  discriminant  function  estimators.  Comparisons  of  these  pro¬ 
cedures  for  fitting  logistic  regression  models  based  on  the  experience 
of  many  researchers  can  be  found  in  the  literature.  The  comparisons 
become  more  complicated  when  one  or  more  values  of  the  independent 
variables  of  certain  observations  are  missing  at  random.  When  data 
are  missing,  researchers  may  not  be  willing  to  base  their  estimates 
only  on  the  subset  of  complete  cases,  particularly  if  the  size  of 
this  subset  is  relatively  small. 

In  this  paper,  six  missing-values  techniques  are  studied: 

DF1:  Discriminant  Function  Estimation  Using  Complete  Obser¬ 
vations 

DF2 :  Discriminant  Function  Estimation  Using  Existing  Pairs 
of  Values  for  Correlations 

DF3 :  Discriminant  Function  Estimation  Adjusting  for  Resi¬ 
dual  Covariances 

ML1:  Maximum  Likelihood  Estimation  Using  Complete  Observa¬ 
tions 

ML2 :  Maximum  Likelihood  Estimation  with  Indicator  Variables 
for  Missing  Data 

WLS :  Weighted  Least  Squares  Estimation  after  Linearizing 
the  Conditional  Probability 

The  estimators  generated  by  the  methods  DF1  and  ML1  simply  ignore 
the  observations  having  missing  components.  Method  DF2  incorporates 
estimated  mean  vectors  and  covariance  matrices  in  the  linear  discri¬ 
minant  function;  the  means  are  calculated  using  all  available  data. 


VI 


but  correlations  are  computed  using  only  the  complete  pairs.  Method 
ML2  first  replaces  missing  values  by  zeros  and  incorporates  additional 
independent  variables  indicating  the  positions  of  the  missing  values. 
Then  the  augmented  logistic  regression  model  is  fitted  by  maximum 
likelihood.  Methods  DF3  and  WLS  are  candidates  when  estimates  of 
missing  values  are  required  based  on  all  other  available  information. 
The  main  feature  of  these  two  methods  is  that  they  allow  for  variances 
resulting  from  errors  due  to  using  approximations  instead  of  the 
actual  values  of  the  independent  variables. 

In  practice,  the  choice  of  procedure  depends  heavily  on  three 
factors:  (1)  the  need  to  estimate  the  missing  values;  (2)  availa¬ 

bility  of  computer  programs;  (3)  execution  time.  Based  on  his 
accumulated  empirical  experience,  the  author  would  like  to  recommend 
using  methods  DF2  or  ML2  in  conjunction  with  either  DF1  or  ML1  for 
estimating  the  logistic  regression  parameters  from  incomplete  data. 
Comparing  the  results  based  on  the  complete  observations  with  those 
derived  by  either  method  DF2  or  ML2  allows  one  to  test  for  possible 
selectivity  bias  that  may  exist.  It  also  provides  a  good  sensitivity 
check  on  the  estimates  of  the  coefficients. 


ACKNOWLEDGMENTS 


The  author  wishes  to  thank  Rand  colleague  Gus  Haggstrom  for 
providing  helpful  reviews  and  editorial  comments  on  the  draft  of 
this  paper.  Special  thanks  are  due  to  Helen  Rhodes  for  her  typing, 
and  to  Becky  Goodman  for  editing  the  final  copy. 


IX 


CONTENTS 

PREFACE . iii 

SUMMARY  .  v 

ACKNOWLEDGMENTS  .  vii 

SECTION 

1.  INTRODUCTION  .  1 

2.  DESCRIPTION  OF  THE  METHODS .  6 

3.  DISCUSSION .  14 

REFERENCES .  17 


1 


1 .  INTRODUCTION 


Let  (X  ,Y  )  ,  (X^Y^),...,  (Xn,Yn)  be  a  random  sample  from  a 
population  n  such  that  Y^  is  1  or  0  according  as  the  i^1  indivi¬ 
dual  in  the  sample  belongs  to  some  population  or  its  comple¬ 
ment  11^ •  The  model  of  interest  is  one  that  relates  this  dichoto¬ 
mous  (quantal)  dependent  variable,  Y,  to  one  or  more  independent 
variables,  X  ,X  , . . . ,X  ,  by  a  logistic  regression  function, 

1  2  p 


E(Y  X)  =  P [Y=l  X]  = 


1+e 


-(ct+X’3) 


(1) 


where  X'  =  (X.  X,  ...  X  )  and  6 ’  =  (3-,  30  ...  6  ). 

i  2  p  1  2  p 

The  estimators  of  the  coefficients  for  this  model  have  been 
studied  by  several  authors.  A  solution  of  a  "classification"  or 
"discrimination"  problem  in  which  an  object  with  given  charac¬ 
teristics  is  to  be  classified  into  one  of  two  alternative  popula¬ 
tions  provides  one  set  of  estimators  for  the  "logistic  regres¬ 
sion"  problem.  The  discriminant  function  solution  turns  out  to 
be  equivalent  to  the  maximum  likelihood  estimators  derived  for 
the  logistic  regression  under  the  assumptions  of  normality  for 
the  X's  and  equal  covariance  matrices  for  the  two  distributions. 
If  we  let  p.,  j  =  0,1,  denote  the  proportion  of  the  population  in 

I!.;  y.  denote  the  mean  vector  of  X  in  n.;  7  denote  the  common 
3  3  3 

covariance  matrix  of  X;  n.  denote  the  number  of  observations  from 

3 

JI.;  x.  denote  the  sample  mean  vector  of  the  n.  individuals  from 
J  J  3 

H  ;  and  S  denote  the  pooled  sample  covariance  matrix  of  the  X's 
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across  the  two  subpopulations,  the  maximum  likelihood ^estimates 
of  the  parameters  are 

a  =  logCp^/p^)  -  |3'  (x1+xq)/2  (2) 

3  =  S  1(x1-x0). 

Here,  p.  =  n./n,  x.  and  S  are  the  maximum  likelihood  estimates  of 
J  T  3 

pj,  P j ,  and  J  respectively.  These  estimates  are  usually 
referred  to  as  the  linear  discriminant  function  estimates  (LDFE) . 
Even  when  the  normality  assumptions  fail  to  hold,  many  previous 
studies  have  still  considered  logistic  regression  an  appropriate 
model,  except  that  in  these  cases  they  merely  assume  that  the 
conditional  distribution  of  Y  given  X  -  x  has  the  logistic  form 

P(X)  .  P[Y=1  |x=x]  -  <3) 

In  this  case,  many  statisticians  prefer  to  use  the  conditional 
maximum  likelihood  estimators  (CMLE)  of  a  and  3  which  maximize 

n  y  1-y 

L(a,3)  =  n  [P(x.)]  1[l-p(x.)]  \  (4) 

i=l  1 

Arguments  for  choosing  either  one  of  these  two  estimators 
for  a  logistic  regression  model  based  on  empirical  evidence - 
have  been  given  by  several  authors  [8,10,11,16].  From  the 
economical  point  of  view,  LDFE's  are  cheaper  to  obtain.  In  their 
comparison,  Halperin,  Blackwelder  and  Verter  [11]  reported  that 
"the  times  required  for  compilation  and  execution  of  the  program 
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were  higher  for  the  MLE  method  than  for  the  discriminant  function 
method  by  factors  ranging  approximately  from  1.3  to  2.0." 
Haggstrom  [10]  also  points  out  that  analysts  of  logistic  models 
should  be  aware  of  the  relationships  between  the  "discrimination" 
and  "logistic  regression"  problems,  if  for  no  other  reason  than 
to  take  advantage  of  the  computational  simplicity  of  the  discrim¬ 
inant  function  estimates  when  doing  exploratory  work  in  fitting 
the  logistic  model. 

In  terms  of  asymptotic  efficiency,  Efron  [8]  shows  that  typ¬ 
ically  the  CMLE  are  between  one-half  and  two-thirds  as  efficient 
as  LDFE  when  X  follows  a  multivariate  normal  distribution  with 
equal  covariance  matrices.  On  the  other  hand,  Press  and  Wilson 
[ 1 6 ]  propose  that  "Simulation  might  be  used  to  determine  the 
relative  efficiency  of  the  two  estimators  under  non-normality, 
but  it  would  not  be  surprising  to  find  the  sufficient  estimator 
(MLE)  dominant."  Other  arguments  given  in  favor  of  CMLE  over  LDEE 
are  that  (1)  LDFE  may  not  be  consistent,  (2)  the  significance 
associated  with  LDFE  may  be  misleading  when  the  normality  assump¬ 
tions  are  violated,  and  (3)  CMLE  forces  the  expected  number  of 
cases  to  be  equal  to  the  observed  number  of  cases,  which  is 
desirable . 

The  comparison  becomes  more  complicated  when  one  or  more 
values  of  the  independent  variables  of  certain  observations  are 
missing,  a  problem  that  occurs  quite  often--especially  in  sample 
surveys.  In  this  paper,  we  consider  treatment  of  the  missing 
values  that  occur  "at  random"  in  the  logistic  regression  model. 

A  great  deal  of  literature  has  been  produced  on  handling  missing 
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data  in  multiple  regression  and  discriminant  analysis  [1-5, 
7,9,12,13,15,19].  In  practice,  there  are  four  commonly  used 
approaches : 


1.  Estimate  coefficients  using  only  the  subset  of  com¬ 
plete  cases. 

2.  Estimate  coefficients  using  sample  moments  and 
correlations  estimated  using  all  available  data. 

3.  Replace  each  missing  value  by  zero  (or  any  constant) 
and  create  an  indicator  variable  for  each  variable 
denoting  whether  the  corresponding  variable  is  miss¬ 
ing  or  not.  The  coefficients  are  then  estimated  by 
regressing  the  dependent  variable  simultaneously  on 
all  the  independent  variables  and  their  correspond¬ 
ing  indicator  variables  provided  that  there  is  at 
least  one  missing  entry  for  that  variable. 

4.  Substitute  missing  values  for  each  observation  using 
estimates  based  on  all  the  other  available  informa¬ 
tion.  The  coefficients  are  then  estimated  using  all 
the  complete  and  completed  observations.  The  sub¬ 
stitutions  are  frequently  obtained  either  by  the 
zero  order  regression  method  [1]  or  the  first  order 
regression  method  [1,2,9,13,15,19]. 

It  should  be  noted  that  many  other  methods  have  been 
proposed  in  dealing  with  specific  situations.  Even  among 
the  four  general  approaches  mentioned  above,  many  variations 
of  the  methods  have  been  suggested.  Some  of  them  are  "quick 
and  dirty,"  some  "simple  but  inconsistent,"  and  some  "com¬ 
plicated  and  costly  even  though  theoretically  more  prefer¬ 
able."  Depending  on  the  pattern  of  missing  values  and  the 
nature  of  the  independent  variables,  no  method  seems  to  suit 
all  cases.  Six  methods  for  generating  estimators  of  the 
coefficients  for  a  logistic  regression  model  are  considered 
in  Section  2.  Three  of  them  are  related  to  linear  discrim¬ 
inant  function  estimates,  two  of  them  carry  the  idea  of  con- 
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ditional  maximum  likelihood  estimates,  and  the  last  one  is  a 
proposed  weighted  least  squares  (WLS)  method  resulting  from 
linearization  of  the  conditional  probability  of  Y  given 
X  =  x.  General  discussions  on  the  choice  of  using  these 
methods  are  given  in  Section  3. 
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2.  DESCRIPTION  OF  THE  METHODS 


The  six  methods  considered  in  this  paper  can  be 
described  as  follows: 

Method  DF1:  Discriminant  Function  Estimation  Using 
Complete  Observations .  All  observations  with  one  or  more 
missing  values  are  omitted  from  analysis.  The  linear 
discriminant  function  estimate  is  calculated  as  usual 
according  to  Eq.  (2)  with  sample  sizes  reduced. 

Method  DF2:  Discriminant  Function  Estimation  Using 
Existing  Pairs  of  Values  for  Correlations .  [a]  The  attempt 

here  is  to  utilize  all  available  information  to  improve  the 
estimation.  In  calculating  the  sample  mean  and  sample  vari¬ 
ance  for  each  variable,  all  observed  values  for  that  vari¬ 
able  are  used.  In  estimating  covariances,  one  first  esti¬ 
mates  correlations  using  all  complete  pairs  of  observations 
and  then  estimates  covariances  by  multiplying  the  sample 
correlations  by  the  corresponding  sample  standard  devia¬ 
tions.  The  estimated  covariance  matrix  formed  in  this  way 
can  then  be  used  to  calculate  the  linear  discriminant  func¬ 
tion  estimate.  Since  this  procedure  produces  consistent 
estimates  of  means  and  covariances,  it  follows  that  discrim- 


[a]  A  more  commonly  used  approach  called  "pairwise  deletion" 
attempts  to  estimate  the  covariances  from  all  complete  pairs  of 
observations . 
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inant  function  estimates  are  also  consistent.  This  method 
is  preferable  to  method  DF1  when  a  large  proportion  of 
observations  have  a  small  number  of  missing  entries. 

Method  DF3:  Discriminant  Function  Estimation  Adjusting 
for  Residual  Covariances.  Buck  [2]  suggested  a  method  of 
estimating  missing  values  in  the  sample  by  regression  tech¬ 
niques  using  only  the  complete  observations.  For  observa¬ 
tions  with  v,  1  <  v  _<  p-1,  variables  missing,  one  calculates 
the  multiple  regression  for  each  missing  variate  on  the 
remaining  p-v  variates  and  then  estimates  the  missing  value 
by  the  fitted  value  obtained  from  the  appropriate  regression 
function.  The  auxiliary  regressions  are  computed  separately 
for  each  value  (zero  or  one)  of  the  dependent  variable. 
However,  when  the  sample  was  completed  by  filling  in  missing 
values,  the  pooled  sample  covariance  matrix  becomes  an 
inconsistent  estimate  of  the  population  covariance  matrix. 
Hence,  in  order  to  get  consistent  estimates  of  the  logistic 
regression  coefficients,  one  needs  to  adjust  for  "residual 
covariances."  Little  [12]  suggests  that  one  first  form 


A  =  {a.},  the  pooled  sum  of  squares  and  cross  products 
Jk 

matrix  of  the  combined  complete  and  completed  observations, 


and  then  adjust  it  as  follows.  For  each  observation  where 


x  and  x.  are  both  missing,  add  to  a  an  estimate  of  the 
j  k  jk 

residual  covariance  (variance  if  j  =  k)  of  x^  and  x^  given 
the  variables  present  in  that  observation.  This  estimate  is 


derived  by  pooling  the  estimated  covariance  matrices  over 


two  sets  of  complete  observations,  one  for  each  value  of  the 
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dependent  variable.  If  A  is  the  adjusted  matrix,  substitut¬ 
ing  S  =  A/(n-p-l)  for  S  in  (2)  yields  a  consistent  set  of 
estimators . 

Method  ML1:  Maximum  Likelihood  Estimation  Using  Com¬ 
plete  Observations .  Maximum  likelihood  estimates  which  max¬ 
imize  Eq.  (4)  are  calculated  using  only  the  subset  of  com¬ 
plete  observations. 

Method  ML2 :  Maximum  Likelihood  Estimation  with  Indica¬ 
tor  Variables  for  Missing  Data .  When  data  are  missing  on 
some  variables,  instead  of  omitting  the  observations  with  at 
least  one  missing  entry  from  the  analysis,  each  missing 
value  can  be  replaced  by  zero.  To  account  for  this  replace¬ 
ment  on  each  incomplete  variable,  create  an  indicator  vari¬ 
able  to  designate  the  missing  pattern  of  that  variable  [17]. 
The  indicator  variable  takes  on  a  value  of  1  if  the  associ¬ 
ated  independent  variable  has  a  missing  entry  and  0  other¬ 
wise.  The  CMLE  will  then  be  obtained  from  the  logistic 
regression  model  with  all  the  independent  variables  and  the 
formed  indicator  variables  included  simultaneously  on  the 
right-side  expression  given  in  Eq.  (3) .  As  an  extension  of 
this  methodology,  more  than  one  indicator  variable  can  be 
created  for  each  incomplete  variate  by  interacting  the  miss¬ 
ing  designator  with  other  characteristics  that  are  judged 
important.  This  procedure  has  the  advantage  of  computa¬ 
tional  simplicity,  it  uses  all  available  information,  and  it 
provides  estimates  of  the  missing  values  which  can  be  used 
to  examine  the  hypothesis  that  data  are  missing  at  random. 
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Method  WLS:  Weighted  Least  Squares  Estimation  after 
Linearizing  the  Conditional  Probability.  When  data  are 
missing  at  random,  one  possible  approach  is  to  estimate  the 
missing  values.  As  described  in  DF3,  Buck  [2]  suggested  a 
method  of  estimating  missing  values  for  each  observation 
using  the  appropriate  regression  functions  of  the  missing 
variables  on  all  the  available  variables  for  that  observa¬ 
tion,  where  the  auxiliary  regression  coefficients  are 
estimated  from  the  subset  of  all  complete  observations. 

This  substitution  introduces  an  additional  approximation 
error  into  the  equation  which  should  be  taken  into  account 
in  the  analysis.  Walker  and  Duncan  [18]  propose  a  weighted 
least  squares  solution  to  the  estimation  of  Eq.  (3)  which, 
as  they  say,  is  equivalent  to  estimation  of  the  parameters 
in  (3)  by  maximum  likelihood  when  the  data  are  complete. 
Following  their  approach  with  linearization  of  the  condi¬ 
tional  probability  in  obtaining  a  linear  formulation,  we 
shall  now  treat  the  model,  when  data  are  missing,  as  if  it 
were  conditional  only  on  the  observed  values  with  the  miss¬ 
ing  data  replaced  by  some  linear  function  of  the  observed 
values.  The  errors  induced  by  such  approximations  will  then 
be  incorporated  with  the  error  of  the  model.  Under  the 
assumption  of  no  pairwise  correlation  between  the  completed 
independent  variables,  the  approximation  errors  and  the 
error  of  the  model,  one  can  then  derive  a  weighted  least 
squares  solution  to  the  problem. 
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For  the  purposes  of  this  discussion,  the  success  proba¬ 
bility  P^  for  the  i^  individual  will  be  represented  in  the 
form 

Pi  =  p[Yi=1lxi=xi]  =  *0^.8)  =  { 1+exp  (-x ' 3 ) ) _1 

where  6  and  x^  are  now  (p+1) -dimensional  column  vectors: 

6  =  (a  3  ...  3  )'  and  x.  =  (1  x_  .  ...  x  .)'.  Following  the 

1  p  x  li  pi 

development  of  Walker  and  Duncan  in  [18],  we  consider  the 
model  in  the  form 


Y±  =  f(x.,3)  +  e..  (5) 

where 

E(e.)  =  0,  Var(e.)  =  P .  (1-P . )  =  P.Q.,  1  <  i  <  n. 

i  ill'  li  —  — 

Expanding  f  in  a  Taylor  series  around  some  initial  guessed 
value  of  3,  say  3,  we  obtain  an  approximation  to  (5)  which 
can  be  written  in  matrix  form  as 

Y*  =  X*3  +  e  (6) 


where  £  and  Y  are  nx  1  vectors  with  elements  e.  and 

i 


Y.  =  Y.  -  P.  +  P.Q.x.  'l 
1  1  1  1  X1  1 


Here,  X  is  the  nx  (p+1)  matrix  having  x^'  =  P^Q^x/  as  its 
ith  row,  P±  =  f(x1,8),  and  Qi  =  1  -  P± . 
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If  the  approximation  (6)  were  an  equality,  the  best 
linear  unbiased  estimator  of  3  would  be  the  weighted  least 
squares  estimator 

~  a  '  a  _i  a '  A 

g  =  (X  WX  )  X  WY  (7) 

where  W  is  the  diagonal  matrix  of  weights  w  =  1 / (P  Q  ). 

li  ii 

Walker  and  Duncan  considered  using  (7)  as  a  means  for  pro¬ 
viding  an  iterative  solution  of  the  normal  equations, 
thereby  identifying  the  solution  as  the  value  of  the  (condi¬ 
tional)  maximum  likelihood  estimator  3. 

When  data  are  incomplete,  one  approach  to  estimate  the 
missing  values  x^  is  to  replace  them  by  some  estimated 
values  x^-  These  x^  are  computed  from  the  appropriate 
regression  function  depending  on  the  available  information 
on  .  The  parameters  of  these  auxiliary  regressions  can  be 
estimated  from  the  subset  of  complete  observations.  With 
the  missing  values  of  each  incomplete  variable  replaced,  one 
can  then  apply  the  Duncan-Walker  procedure  to  the  completed 
data  matrix  to  derive  estimates  that  are  analogous  to  the 
conditional  maximum  likelihood  estimates  for  complete  data. 

The  proposed  WLS  approach  in  the  presence  of  missing 
data  begins  by  rewriting  (6)  in  the  form 


•'A  ~A 
Y  =  X  3  +  w. 


(8) 


Ignoring  all  terms  of  order  greater  than  one  in  the  approxi- 
til 

mation,  the  i  coordinate  of  w  is 


w.  =  E±  +  (x*-x*)'(23-3)  =  c.  +  P.Q  I  (2ek-3k)uik 

keM. 

i 
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where  -  x^.  Assuming  the  covariance  between  the 

last  two  terms  above  is  negligible,  we  have  that 


Var(w  )  =  Var(e  )  +  (P.Q.)  I  I  (26  -6  ) (2g  ■ -g  ) a  (9) 

1  1  11  jeM.  keM.  J  J  k  k  jk 

J  i  i 


where  =  Cov(u^_,  .  u^)  •  Since  the  terms  cr  ^  can  be 
estimated  using  the  residuals  x^  -  x^  from  the  regressions 
on  the  complete  cases,  the  covariance  matrix  of  w  can  be 
estimated  using  the  diagonal  matrix 


I.  -  B  +  LCB) 


where  B  is  a  diagonal  matrix  with  diagonal  elements  being 
estimates  of  P^Q^,  and  the  i^  diagonal  element  of  L(B)  is 
an  estimate  of  the  second  term  on  the  right  in  (9)  that 
incorporates  estimates  of  o_.^  and  .  Hence,  is  itself 

a  function  of  3  . 

The  proposed  weighted  least  squares  estimate  for  incom¬ 
plete  data  is  given  by  the  equation 


mM 


(x  l  Lx  )  Lxl  \  . 

^w  Lvr 


(10) 


This  equation  represents  a  system  of  (p+1)  simultaneous 
non-linear  equations  in  the  (p+1)  unknown  elements  of  6 
which  can  be  used  to  determine  an  iterative  solution  analo- 
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gous  to  the  Duncan- Walker  solution  for  determining  the  max¬ 
imum  likelihood  estimator. 

It  can  be  shown  that  if  the  approximation  (8)  were  an 
equality,  the  solution  to  (10)  is  a  consistent  estimator  of 
3  as  long  as  3  is  consistent.  The  proof  is  similar  to  the 
idea  outlined  in  the  Appendix  of  [7].  Nonetheless,  one  can¬ 
not  assure  that  this  solution  3*»  is  also  a  consistent  esti- 

M 

mator  of  3  in  (5)  without  further  investigation.  However, 
if  the  proportion  of  missing  data  is  small,  we  believe  that 
even  if  it  is  not  consistent,  the  asymptotic  bias  should  be 
small.  At  least  in  the  case  when  no  data  are  missing,  the 
consistency  of  3  is  well  established.  Hence,  any  efficient 
optimization  procedure  applied  to  (10)  should  converge  after 
several  iterations  if  some  good  initial  estimate  3  is  used. 
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3.  DISCUSSION 


When  parameters  of  logistic  regression  are  estimated 
from  data  which  contain  incomplete  information,  several 
methods  can  be  used.  Some  of  the  procedures  may  be  simple 
but  inconsistent,  such  as  the  linear  discriminant  function 
estimators  in  the  non-normal  case;  some  may  be  complicated 
to  compute,  such  as  the  DF3  and  WLS  estimators.  However, 
when  the  data  are  not  missing  in  any  systematic  fashion,  it 
appears  from  empirical  evidence  that  the  differences  in  the 
estimates  are  usually  not  large.  One  such  application  of 
all  these  methods  to  a  numerical  example  can  be  found  in 
[6], 

Budget  constraints  and  availability  of  computer  pro¬ 
grams  normally  constrain  the  number  of  alternative 
approaches.  The  decision  as  to  which  method  to  use  depends 
also  on  the  missing  pattern  of  the  variables  and  the  need 
for  estimating  missing  values. 

In  practice,  the  choice  of  estimation  depends  heavily 
on  three  dominant  factors.  First,  it  depends  on  whether  the 
researcher  wants  to  estimate  missing  values.  In  some  situa¬ 
tions,  derived  scores  are  required  for  each  subject.  In 
such  cases,  it  is  more  appropriate  to  use  either  method  DF3 
or  WLS  for  estimating  the  parameters.  Second,  execution 
time  plays  an  important  role  in  selecting  which  estimation 
method  to  use.  DF1  is  most  efficient  in  this  sense;  the 
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others  can  take  as  much  as  four  times  as  long  to  compute. 

Third,  availability  of  computer  programs  that  handle  miss¬ 
ing  values  also  confines  the  type(s)  of  estimation  methods 
one  can  apply.  Obviously,  method  DF1  can  be  performed 
easily  using  any  multiple  regression  package.  If  a  condi¬ 
tional  maximum  likelihood  program  is  also  readily  available 
[14,18]  methods  ML1  and  ML2  can  also  be  used.  By  specifying 
the  CORPAIR  option  in  program  BMDP8D,  one  can  obtain  the 
correlation  matrix  needed  for  computing  estimator  DF2. 

Currently,  computer  programs  exist  at  The  Rand  Corporation 
for  performing  methods  DF3  and  WLS,  but  these  are  not  easily 
adapted  to  other  computer  facilities . [a] 

Based  on  his  accumulated  empirical  experience,  the 
author  would  like  to  strongly  recommend  using  either  method 
DF2  or  ML2  for  estimating  the  parameters  in  logistic  regres¬ 
sion  with  incomplete  data.  Even  if  the  assumption  that  data 
are  missing  at  random  is  not  found  to  be  violated,  it  is 
still  desirable  to  compare  the  results  based  on  the  complete 
observations  with  those  derived  by  either  method  DF2  or  ML2 
utilizing  all  the  available  information.  It  is  not  surpris¬ 
ing  to  find  different  effects  being  shown  for  few  variables 
in  many  data  sets  containing  missing  entries.  If  this  hap¬ 
pens,  further  investigations  for  any  possible  selectivity 

[a]  The  programs  rely  heavily  on  STATLIB,  a  statistical  com¬ 
puting  library  developed  at  Bell  Laboratories  and  at  Rand.  Since 
this  library  is  not  yet  ready  for  wide  distribution,  the  programs 
are  also  not  yet  available  for  general  use. 
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bias  that  may  exist  are  needed.  Even  if  similar  results  are  obtained 
using  the  two  methods,  the  runs  provide  a  sensitivity  check  on  the 
estimates  of  the  coefficients  in  the  logistic  regression  model. 
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